CS236781: Deep Learning on Computational Accelerators

Homework Assignment 4

Faculty of Computer Science, Technion.

Submitted by:

# Name Id email
Student 1 Nitzan Madar 203483334 NitzanMadar@Campus.Technion.ac.il
Student 2 Guy Bar-Shalom 313537896 Guy.B@Campus.Technion.ac.il
Student 3 Hod Paska 204441810 Hodp@Campus.Technion.ac.il

Introduction

In this assignment we'll explore deep reinforcement learning. We'll implement two popular and related methods for directly learning the policy of an agent for playing a simple video game.

General Guidelines

Contents

$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\cset}[1]{\mathcal{#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bb}[1]{\boldsymbol{#1}} \newcommand{\E}[2][]{\mathbb{E}_{#1}\left[#2\right]} \newcommand{\ip}[3]{\left<#1,#2\right>_{#3}} \newcommand{\given}[]{\,\middle\vert\,} \newcommand{\DKL}[2]{\cset{D}_{\text{KL}}\left(#1\,\Vert\, #2\right)} \newcommand{\grad}[]{\nabla} $$

Part 1: Deep Reinforcement Learning

In the tutorial we have seen value-based reinforcement learning, in which we learn to approximate the action-value function $q(s,a)$.

In this exercise we'll explore a different approach, directly learning the agent's policy distribution, $\pi(a|s)$ by using policy gradients, in order to safely land on the moon!

Some technical notes before we begin:

Policy gradients

Recall from the tutorial that we define the policy of an agent as the conditional distribution, $$ \pi(a|s) = \Pr(a_t=a\vert s_t=s), $$ which defines how likely the agent is to take action $a$ at state $s$.

Furthermore we define the action-value function, $$ q_{\pi}(s,a) = \E{g_t(\tau)|s_t = s,a_t=a,\pi} $$ where $$ g_t(\tau) = r_{t+1}+\gamma r_{t+2} + \dots = \sum_{k=0}^{\infty} \gamma^k r_{t+1+k}, $$ is the total discounted reward of a specific trajectory $\tau$ from time $t$, and the expectation in $q$ is over all possible trajectories, $ \tau=\left\{ (s_0,a_0,r_1,s_1), \dots (s_T,a_T,r_{T+1},s_{T+1}) \right\}. $

In the tutorial we saw that we can learn a value function starting with some random function and updating it iteratively by using the Bellman optimality equation. Given that we have some action-value function, we can immediately create a policy based on that by simply selecting an action which maximize the action-value at the current state, i.e. $$ \pi(a|s) = \begin{cases} 1, & a = \arg\max_{a'\in\cset{A}} q(s,a') \\ 0, & \text{else} \end{cases}. $$ This is called $q$-learning. This approach aims to obtain a policy indirectly through the action-value function. Yet, in most cases we don't actually care about knowing the value of particular states, since all we need is a good policy for our agent.

Here we'll take a different approach and learn a policy distribution $\pi(a|s)$ directly - by using policy gradients.

Formalism

We define a parametric policy, $\pi_\vec{\theta}(a|s)$, and maximize total discounted reward (or minimize the negative reward): $$ \mathcal{L}(\vec{\theta})=\E[\tau]{-g(\tau)|\pi_\vec{\theta}} = -\int g(\tau)p(\tau|\vec{\theta})d\tau, $$ where $p(\tau|\vec{\theta})$ is the probability of a specific trajectory $\tau$ under the policy defined by $\vec{\theta}$.

Since we want to find the parameters $\vec{\theta}$ which minimize $\mathcal{L}(\vec{\theta})$, we'll compute the gradient w.r.t. $\vec{\theta}$: $$ \grad\mathcal{L}(\vec{\theta}) = -\int g(\tau)\grad p(\tau|\vec{\theta})d\tau. $$

Unfortunately, if we try to write $p(\tau|\vec{\theta})$ explicitly, we find that computing it's gradient with respect to $\vec{\theta}$ is quite intractable due to a huge product of terms depending on $\vec{\theta}$: $$ p(\tau|\vec{\theta})=p\left(\left\{ (s_t,a_t,r_{t+1},s_{t+1})\right\}_{t\geq0}\given\vec{\theta}\right) =p(s_0)\prod_{t\geq0} \pi_{\vec{\theta}}(a_t|s_t)p(s_{t+1}|s_t,a_t). $$

However, by using the fact that $\grad_{x}\log(f(x))=\frac{\grad_{x}f(x)}{f(x)}$, we can convert the product into a sum: $$ \begin{align} \grad\mathcal{L}(\vec{\theta}) &= -\int g(\tau)\grad p(\tau|\vec{\theta})d\tau = -\int g(\tau)\frac{\grad p(\tau|\vec{\theta})}{p(\tau|\vec{\theta})}p(\tau|\vec{\theta})d\tau \\ &= -\int g(\tau)\grad\log\left(p(\tau|\vec{\theta})\right)p(\tau|\vec{\theta})d\tau \\ &= -\int g(\tau)\grad\log\left( p(s_0)\prod_{t\geq0} \pi_{\vec{\theta}}(a_t|s_t)p(s_{t+1}|s_t,a_t) \right) p(\tau|\vec{\theta})d\tau \\ &= -\int g(\tau)\grad\left( \log p(s_0) + \sum_{t\geq0} \log \pi_{\vec{\theta}}(a_t|s_t) + \sum_{t\geq0}\log p(s_{t+1}|s_t,a_t) \right) p(\tau|\vec{\theta})d\tau \\ &= -\int g(\tau)\sum_{t\geq0} \grad\log \pi_{\vec{\theta}}(a_t|s_t) p(\tau|\vec{\theta})d\tau \\ &= \E[\tau]{-g(\tau)\sum_{t\geq0} \grad\log \pi_{\vec{\theta}}(a_t|s_t)}. \end{align} $$

This is the "vanilla" version of the policy gradient. We can interpret is as a weighted log-likelihood function. The log-policy is the log-likelihood term we wish to maximize and the total discounted reward acts as a weight: high-return positive trajectories will cause the probability of actions taken during them to increase, and negative-return trajectories will cause the probabilities of actions taken to decrease.

In the following figures we see three trajectories: high-return positive-reward (green), low-return positive-reward (yellow) and negative-return (red) and the action probabilities along the trajectories after the update. Credit: Sergey Levine.

The major drawback of the policy-gradient is it's high variance, which causes erratic optimization behavior and therefore slow convergence. One reason for this is that the log-policy weight term, $g(\tau)$ can vary wildly between different trajectories, even if they're similar in actions. Later on we'll implement the loss and explore some methods of variance reduction.

Landing on the moon with policy gradients

In the spirit of the recent achievements of the Israeli space industry, we'll apply our reinforcement learning skills to solve a simple game called LunarLander.

This game is available as an environment in OpenAI gym.

In this environment, you need to control the lander and get it to land safely on the moon. To do so, you must apply bottom, right or left thrusters (each are either fully on or fully off) and get it to land within the designated zone as quickly as possible and with minimal wasted fuel.

The observations at each step is the Lander's position, velocity, angle, angular velocity and ground contact state. The actions are no-op, fire left truster, bottom thruster and right thruster.

You are highly encouraged to read the documentation in the source code of the LunarLander environment to understand the reward system, and see how the actions and observations are created.

Policy network and Agent

Let's start with our policy-model. This will be a simple neural net, which should take an observation and return a score for each possible action.

TODO:

  1. Implement all methods in the PolicyNet class in the hw4/rl_pg.py module. Start small. A simple MLP with a few hidden layers is a good starting point. You can come back and change it later based on the the experiments.
    Notice that we'll use the build_for_env method to instantiate a PolicyNet based on the configuration of a given environment.
  2. If you need hyperparameters to configure your model (e.g. number of hidden layers, sizes, etc.), add them in part1_pg_hyperparams() in hw4/answers.py.

Now we need an agent. The purpose of our agent will be to act according to the current policy and generate experiences. Our PolicyAgent will use a PolicyNet as the current policy function.

We'll also define some extra datatypes to help us represent the data generated by our agent. You can find the Experience, Episode and TrainBatch datatypes in the hw4/rl_data.py module.

TODO: Implement the current_action_distribution() method of the PolicyAgent class in the hw4/rl_pg.py module.

TODO: Implement the step() method of the PolicyAgent.

To test our agent, we'll write some code that allows it to play an environment. We'll use the Monitor wrapper in gym to generate a video of the episode for visual debugging.

TODO: Complete the implementation of the monitor_episode() method of the PolicyAgent.

To display the Monitor video in this notebook, we'll use a helper function from our jupyter_utils and a small wrapper that extracts the path of the last video file.

Training data

The next step is to create data to train on. We need to train on batches of state-action pairs, so that our network can learn to predict the actions.

We'll split this task into three parts:

  1. Generate a batch of Episodes, by using an Agent that's playing according to our current policy network. Each Episode object contains the Experience objects created by the agent.
  2. Calculate the total discounted reward for each state we encountered and action we took. This is our action-value estimate.
  3. Convert the Episodes into a batch of tensors to train on. Each batch will contain states, action taken per state, reward accrued, and the calculated estimated state-values. These will be stored in a TrainBatch object.

TODO: Complete the implementation of the episode_batch_generator() method in the TrainBatchDataset class within the hw4.rl_data module. This will address part 1 in the list above.

TODO: Complete the implementation of the calc_qvals() method in the Episode class. This will address part 2. These q-values are an estimate of the actual action value function: $$\hat{q}_{t} = \sum_{t'\geq t} \gamma^{t'-t}r_{t'+1}.$$

TODO: Complete the implementation of the from_episodes() method in the TrainBatch class. This will address part 3.

Notes:

Loss functions

As usual, we need a loss function to optimize over. We'll calculate three types of losses:

  1. The causal vanilla policy gradient loss.
  2. The policy gradient loss, with a baseline to reduce variance.
  3. An entropy-based loss whos purpose is to diversify the agent's action selection, and prevent it from being "too sure" about its actions. This loss will be used together with one of the above losses.

Causal vanilla policy-gradient

We have derived the policy-gradient as $$ \grad\mathcal{L}(\vec{\theta}) = \E[\tau]{-g(\tau)\sum_{t\geq0} \grad\log \pi_{\vec{\theta}}(a_t|s_t)}. $$

By writing the discounted reward explicitly and enforcing causality, i.e. the action taken at time $t$ can't affect the reward at time $t'<t$, we can get a slightly lower-variance version of the policy gradient:

$$ \grad\mathcal{L}_{\text{PG}}(\vec{\theta}) = \E[\tau]{-\sum_{t\geq0} \left(\sum_{t'\geq t} \gamma^{t'}r_{t'+1} \right)\grad\log \pi_{\vec{\theta}}(a_t|s_t)}. $$

In practice, the expectation over trajectories is calculated using a Monte-Carlo approach, i.e. simply sampling $N$ trajectories and average the term inside the expectation. Therefore, we will use the following estimated version of the policy gradient:

$$ \begin{align} \hat\grad\mathcal{L}_{\text{PG}}(\vec{\theta}) &=-\frac{1}{N}\sum_{i=1}^{N}\sum_{t\geq0} \left(\sum_{t'\geq t} \gamma^{t'}r_{i,t'+1} \right)\grad\log \pi_{\vec{\theta}}(a_{i,t}|s_{i,t}) \\ &=-\frac{1}{N}\sum_{i=1}^{N}\sum_{t\geq0} \hat{q}_{i,t} \grad\log \pi_{\vec{\theta}}(a_{i,t}|s_{i,t}). \end{align} $$

Note the use of the notation $\hat{q}_{i,t}$ to represent the estimated action-value at time $t$ in the sampled trajectory $i$. Here $\hat{q}_{i,t}$ is acting as the weight-term for the policy gradient.

TODO: Complete the implementation of the VanillaPolicyGradientLoss class in the hw4/rl_pg.py module.

Policy-gradient with baseline

Another way to reduce the variance of our gradient is to use relative weighting of the log-policy instead of absolute reward values. $$ \hat\grad\mathcal{L}_{\text{BPG}}(\vec{\theta}) =-\frac{1}{N}\sum_{i=1}^{N}\sum_{t\geq0} \left(\hat{q}_{i,t}-b\right) \grad\log \pi_{\vec{\theta}}(a_{i,t}|s_{i,t}). $$ In other words, we don't measure a trajectory's worth by it's total reward, but by how much better that total reward is relative to some expected ("baseline") reward value, denoted above by $b$. Note that subtracting a baseline has no effect on the expected value of the policy gradient. It's easy to prove this directly by definition.

Here we'll implement a very simple baseline (not optimal in terms of variance reduction): the average of the estimated state-values $\hat{q}_{i,t}$.

TODO: Complete the implementation of the BaselinePolicyGradientLoss class in the hw4/rl_pg.py module.

Entropy loss

The entropy of a probability distribution (in our case the policy), is $$ H(\pi) = -\sum_{a} \pi(a|s)\log\pi(a|s). $$ The entropy is always positive and obtains it's maximum for a uniform distribution. We'll use the entropy of the policy as a bonus, i.e. we'll try to maximize it. The idea is the prevent the policy distribution from becoming too narrow and thus promote the agent's exploration.

First, we'll calculate the maximal possible entropy value of the action distribution for a set number of possible actions. This will be used as a normalization term.

TODO: Complete the implementation of the calc_max_entropy() method in the ActionEntropyLoss class.

TODO: Complete the implementation of the forward() method in the ActionEntropyLoss class.

Training

We'll implement our training procedure as follows:

  1. Initialize the current policy to be a random policy.
  2. Sample $N$ trajectories from the environment using the current policy.
  3. Calculate the estimated $q$-values, $\hat{q}_{i,t} = \sum_{t'\geq t} \gamma^{t'}r_{i,t'+1}$ for each trajectory $i$.
  4. Calculate policy gradient estimate $\hat\grad\mathcal{L}(\vec{\theta})$ as defined above.
  5. Perform SGD update $\vec{\theta}\leftarrow\vec{\theta}-\eta\hat\grad\mathcal{L}(\vec{\theta})$.
  6. Repeat from step 2.

This is known as the REINFORCE algorithm.

Fortunately, we've already implemented everything we need for steps 1-4 so we need only a bit more code to put it all together.

The following block implements a wrapper, train_pg to create all the objects we need in order to train our policy gradient model.

The PolicyTrainer class implements the training loop, collects the losses and rewards and provides some useful checkpointing functionality. The training loop will generate batches of episodes and train on them until either:

Most of this class is already implemented for you.

TODO:

  1. Complete the training loop by implementing the train_batch() method of the PolicyTrainer.
  2. Tweak the hyperparameters in the part1_pg_hyperparams() function within the hw4/answers.py module as needed. You get some sane defaults.

Let's check whether our model is actually training. We'll try to reach a very low (bad) target reward, just as a sanity check to see that training works. Your model should be able to reach this target reward within a few batches.

You can increase the target reward and use this block to manually tweak your model and hyperparameters a few times.

Experimenting with different losses

We'll now run a few experiments to see the effect of diferent loss functions on the training dynamics. Namely, we'll try:

  1. Vanilla PG (vpg): No baseline, no entropy
  2. Baseline PG (bpg): Baseline, no entropy loss
  3. Entropy PG (epg): No baseline, with entropy loss
  4. Combined PG (cpg): Baseline, with entropy loss

We'll save the training data from each experiment for plotting.

Let's run the experiments! We'll run each configuration for a fixed number of episodes so that we can compare them.

Notes:

  1. Until your models start working, you can decrease the number of episodes for each experiment, or only run one experiment.
  2. The results will be saved in a file. To re-run the experiments, you can set force_run to True.

You should see positive training dynamics in the graphs (reward going up). If you don't, use them to further update your model or hyperparams.

To pass the test, you'll need to get a best total mean reward of at least 10 in the fixed number of epochs using the combined loss. It's possible to get much higher (over 100).

Now let's take a look at a gameplay video of our cpg model after the short training!

Advantage Actor-Critic (AAC)

We have seen that the policy-gradient loss can be interpreted as a log-likelihood of the policy term (selecting a specific action at a specific state), weighted by the future rewards of that choice of action.

However, naïvely weighting by rewards has significant drawbacks in terms of the variance of the resulting gradient. We addressed this by adding a simple baseline term which represented our "expected reward" so that we increase probability of actions leading to trajectories which exceed this expectation and vice-versa.

In this part we'll explore a more powerful baseline, which is the idea behind the AAC method.

The advantage function

Recall the definition of the state-value function $v_{\pi}(s)$ and action-value function $q_{\pi}(s,a)$:

$$ \begin{align} v_{\pi}(s) &= \E{g(\tau)|s_0 = s,\pi} \\ q_{\pi}(s,a) &= \E{g(\tau)|s_0 = s,a_0=a,\pi}. \end{align} $$

Both these functions represent the value of the state $s$. However, $v_\pi$ averages over the first action according to the policy, while $q_\pi$ fixes the first action and then continues according to the policy.

Their difference is known as the advantage function: $$ a_\pi(s,a) = q_\pi(s,a)-v_\pi(s). $$

If $a_\pi(s,a)>0$ it means that it's better (in expectation) to take action $a$ in state $s$ compared to the average action. In other words, $a_\pi(s,a)$ represents the advantage of using action $a$ in state $s$ compared to the others.

So far we have used an estimate for $q_\pi$ as our weighting term for the log-policy, with a fixed baseline per batch.

$$ \hat\grad\mathcal{L}_{\text{BPG}}(\vec{\theta}) =-\frac{1}{N}\sum_{i=1}^{N}\sum_{t\geq0} \left(\hat{q}_{i,t}-b\right) \grad\log \pi_{\vec{\theta}}(a_{i,t}|s_{i,t}). $$

Now, we will use the state value as a baseline, so that an estimate of the advantage function is our weighting term:

$$ \hat\grad\mathcal{L}_{\text{AAC}}(\vec{\theta}) =-\frac{1}{N}\sum_{i=1}^{N}\sum_{t\geq0} \left(\hat{q}_{i,t}-v_\pi(s_t)\right) \grad\log \pi_{\vec{\theta}}(a_{i,t}|s_{i,t}). $$

Intuitively, using the advantage function makes sense because it means we're weighting our policy's actions according to how advantageous they are compared to other possible actions.

But how will we know $v_\pi(s)$? We'll learn it of course, using another neural network. This is known as actor-critic learning. We simultaneously learn the policy (actor) and the value of states (critic). We'll treat it as a regression task: given a state $s_t$, our state-value network will output $\hat{v}_\pi(s_t)$, an estimate of the actual unknown state-value. Our regression targets will be the discounted rewards, $\hat{q}_{i,t}$ (see question 2), and we can use a simple MSE as the loss function, $$ \mathcal{L}_{\text{SV}} = \frac{1}{N}\sum_{i=1}^{N}\sum_{t\geq0}\left(\hat{v}_\pi(s_t) - \hat{q}_{i,t}\right)^2. $$

Implementation

We'll build heavily on our implementation of the regular policy-gradient method, and just add a new model class and a new loss class, with a small modification to the agent.

Let's start with the model. It will accept a state, and return action scores (as before), but also the value of that state. You can experiment with a dual-head network that has a shared base, or implement two separate parts within the network.

TODO:

  1. Implement the model as the AACPolicyNet class in the hw4/rl_ac.py module.
  2. Set the hyperparameters in the part1_aac_hyperparams() function of the hw4.answers module.

TODO: Complete the implementation of the agent class, AACPolicyAgent, in the hw4/rl_ac.py module.

TODO: Implement the AAC loss function as the class AACPolicyGradientLoss in the hw4/rl_ac.py module.

Experimentation

Let's run the same experiment as before, but with the AAC method and compare the results.

You should get better results with the AAC method, so this time the bar is higher (again, you should aim for a mean reward of 100+). Compare the graphs with combined PG method and see if they make sense.

Final model training and visualization

Now, using your best model and hyperparams, let's train model for much longer and see the performance. Just for fun, we'll also visualize an episode every now and then so that we can see how well the agent is playing.

TODO:

Questions

TODO: Answer the following questions. Write your answers in the appropriate variables in the module hw4/answers.py.

Question 1

Explain qualitatively why subtracting a baseline in the policy-gradient helps reduce it's variance. Specifically, give an example where it helps.